Coming from Vietnam, we have gained recognition of how globally important the real estate markets are, especially in such dynamic, well-developed regions such as the Seattle Metropolitan Area. The project’s objectives include exploring housing price trends in principal cities within the area. From data analysis, we hope to deliver insights that may work for local and international investors, including those from Vietnam.
The focus will be on a dataset of property sales, examining a wide range of features that influence housing prices and transaction frequency. The features in the dataset are mentioned below:
Date: The variable represents the date when a house was sold, so understanding temporal trends will help us identify seasonal fluctuations in property prices and further timing investment decisions.
Price: This is the housing price in the USD unit.
House features: These include:
In our project analyzing house prices in the Seattle Metropolitan Area, we will answer several questions that can provide insights into buyer preferences and market trends, including transaction and house features.
We aim to explore housing transactions, both transaction frequency and transaction values, to uncover overall market trends, including weekly and monthly fluctuations in activity. Additionally, analyzing price ranges will be one of our key focuses, since this will provide insights into the stability of the housing market during the period in question.
We aim to examine the correlation among house age, price, and transaction. Besides, another factor that is in our analysis is the presence of waterfront, whether it affected customer behavior.
We will discover whether the year in which properties were built in could significantly affect the price of the houses that have similar features (such as comparable area size or number of floors).
The question aims to explore how the size of the living area and the number of floors impacted buyer preferences. Whether larger living spaces and higher floors led to higher prices is our focus.
We desire to know the ratio between bathroom and bedroom that customers preferred. The question will help us identify buyers’ priorities regarding housing functionality and livability,
To process the given dataset, we use a set of libraries, including:
readr: We use the library to read the dataset
(.csv file)dyplr: We use the library to transform and manipulate
the data, including filtering, arranging, summarizing, and grouping
variables.lubridate: The package is used in working with dates
and times more interactively. We can simply working process with time
variables thanks to the library.ggplot2
and plotly. We use ggplot2 to ensure
flexibility and visualization, whereas plotly can
supplement the interaction between charts and the audience.library(readr)
library(dplyr)
library(ggplot2)
library(plotly)
library(lubridate)
First, we read the dataset USA Housing Dataset (1).csv.
Then, we select datasets from principal cities in the Seattle
Metropolitan Area, including Seattle, Tacoma, Bellevue, Kent, Everett,
Renton, Auburn, Lakewood, Redmond (the classification is from Stastistical
Atlas) The newly set up one is named
masster_housing_city.
To better describe the house price, we used the unit “thousdand dollars” (KUSD), so we divided the initial value by 1000.
master_housing <- read_csv("USA Housing Dataset (1).csv")
# Divide the price by 1000 to use "KUSD" unit
master_housing <- master_housing %>%
mutate(price = price / 1000)
# Round the value in price to 1 decimal place
master_housing <- master_housing %>%
mutate(price = round(price, 1))
master_housing_city <- master_housing %>%
filter(city %in% c("Seattle", "Tacoma", "Bellevue", "Kent", "Everett", "Renton", "Auburn", "Lakewood", "Redmond")) %>%
group_by(city)
We summed up the transaction frequency and value to see on what day
the transaction takes place the most by adding a new column via
mutate and group_by days of the week.
# Transaction frequency per day
transactions_per_day <- master_housing_city %>%
#mutate(day = as.Date(date)) %>%
group_by(date) %>%
summarise(number_of_transactions = n())
transactions_per_day_of_week <- transactions_per_day %>%
mutate(day_of_week = weekdays(date)) %>%
group_by(day_of_week) %>%
summarise(total_transactions = sum(number_of_transactions)) %>%
mutate(day_of_week = factor(day_of_week, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))
# Transaction value per day
total_price_by_day_of_week <- master_housing_city %>%
mutate(day_of_week = weekdays(date)) %>%
group_by(day_of_week) %>%
summarise(total_price = sum(price, na.rm = TRUE))
We used as.Date and group_by to group
datasets by week, so we can see the trend and compare weekly
transactions:
# Transaction frequency per week
transactions_per_week <- transactions_per_day %>%
mutate(week = as.Date(cut(date, breaks = "week"))) %>%
group_by(week) %>%
summarise(total_transactions = sum(number_of_transactions))
# Transaction value per week
total_transaction_cost_per_week <- master_housing_city %>%
mutate(week = floor_date(date, "week")) %>%
group_by(week) %>%
summarise(total_transaction_cost = sum(price, na.rm = TRUE))
We cleaned the data of house price, including removing outliers and price equal to 0.
# Remove outliers and zero price from the dataset
Q1 <- quantile(master_housing_city$price, 0.25)
Q3 <- quantile(master_housing_city$price, 0.75)
IQR <- Q3 - Q1
# Define the lower and upper bounds for outliers
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
# Filter the dataset to remove outliers and zero price
cleaned_data <- master_housing_city %>%
filter(price > 0 & price >= lower_bound & price <= upper_bound)
#clean data for each week
data_clean_each_week <- master_housing_city %>%
mutate(week = as.Date(cut(date, breaks = "week"))) %>%
group_by(week) %>%
filter(price > 0 & price >= lower_bound & price <= upper_bound)
To easily determine type of chart, we calculated the mean and median of the dataset after being cleaned:
mean_price <- mean(cleaned_data$price)
median_price <- median(cleaned_data$price)
# Calculate median price per week
median_price_per_week <- data_clean_each_week %>%
group_by(week) %>%
summarise(median_price = median(price))
Moreover, house prices were categorized in range to see the gap more clearly.
# Categorize the price of houses
data_clean_each_week <- data_clean_each_week %>%
mutate(price_category = case_when(
price < 100 ~ "<100,000",
price >= 100 & price < 400 ~ "100,000-400,000",
price >= 400 & price < 700 ~ "400,000-700,000",
price >= 700 & price < 1000 ~ "700,000-1,000,000",
price >= 1000 ~ ">1,000,000"
))
# Count the number of houses in each price category per week
price_category_per_week <- data_clean_each_week %>%
group_by(week, price_category) %>%
summarise(count = n()) %>%
ungroup()
We calculated the average price by house age to support the analysis
by using as.numeric to calculate age specifically:
# Calculate average price by house age
stats_by_age <- cleaned_data %>%
mutate(year = as.numeric(format(date, "%Y")),
age = year - yr_built) %>%
group_by(age) %>%
summarise(avg_price = mean(price, na.rm = TRUE))
We calculated the number of transactions for each house age and removed outliers:
transaction_counts <- master_housing %>%
mutate(age = as.numeric(format(Sys.Date(), "%Y")) - yr_built) %>%
group_by(age) %>%
summarise(transaction_count = n())
Q1 <- quantile(transaction_counts$transaction_count, 0.25)
Q3 <- quantile(transaction_counts$transaction_count, 0.75)
IQR <- Q3 - Q1
transaction_counts <- transaction_counts %>%
filter(transaction_count >= (Q1 - 1.5 * IQR) & transaction_count <= (Q3 + 1.5 * IQR))
The average house prices were calculated for waterfront and non-waterfront properties (outliers were removed):
# Calculate the average house price for waterfront and non-waterfront properties
waterfront_prices <- master_housing %>%
group_by(waterfront) %>%
summarise(average_price = mean(price, na.rm = TRUE))
# Remove outliers using IQR method
Q1 <- quantile(waterfront_prices$average_price, 0.25)
Q3 <- quantile(waterfront_prices$average_price, 0.75)
IQR <- Q3 - Q1
waterfront_prices <- waterfront_prices %>%
filter(average_price >= (Q1 - 1.5 * IQR) & average_price <= (Q3 + 1.5 * IQR))
We calculated the transaction distribution based on house conditions and waterfront before removing outliers.
# Calculate the transaction distribution based on houses' condition (1-5 level) and waterfront
condition_distribution <- master_housing %>%
group_by(condition, waterfront) %>%
summarise(transaction_count = n())
# Remove outliers using IQR method for each condition and waterfront combination
condition_distribution <- condition_distribution %>%
group_by(condition) %>%
mutate(Q1 = quantile(transaction_count, 0.25),
Q3 = quantile(transaction_count, 0.75),
IQR = Q3 - Q1) %>%
filter(transaction_count >= (Q1 - 1.5 * IQR) & transaction_count <= (Q3 + 1.5 * IQR)) %>%
ungroup() %>%
select(-Q1, -Q3, -IQR)
The average price grouped by house conditions was calculated (removing outliers).
# Calculate the average house price by house condition
avg_price_by_condition <- master_housing %>%
group_by(condition) %>%
summarise(avg_price = mean(price, na.rm = TRUE))
# Remove outliers using IQR method for each condition
avg_price_by_condition <- avg_price_by_condition %>%
mutate(Q1 = quantile(avg_price, 0.25),
Q3 = quantile(avg_price, 0.75),
IQR = Q3 - Q1) %>%
filter(avg_price >= (Q1 - 1.5 * IQR) & avg_price <= (Q3 + 1.5 * IQR)) %>%
select(-Q1, -Q3, -IQR)
We divided price range into categories and filter data for selected years:
# Define price categories
master_housing <- master_housing %>%
mutate(price_category = case_when(
price < 100 ~ "<100",
price >= 100 & price < 400 ~ "100-400",
price >= 400 & price < 700 ~ "400-700",
price >= 700 & price < 1000 ~ "700-1,000",
price >= 1000 ~ ">1,000"
))
# Filter the data for the years 1954, 1974, 1994, 2014
filtered_years <- master_housing %>%
filter(yr_built %in% c(1954, 1974, 1994, 2014))
We categorized price range with houses having 1 floor and 2 floors, two most common number of floors.
# Filter the data for houses built in 1954, 1974, 1994, and 2014 with 1 floor
filtered_data <- master_housing %>%
filter(yr_built %in% c(1954, 1974, 1994, 2014) & floors == 1)
# Categorize the price of houses
filtered_data <- filtered_data %>%
mutate(price_category = case_when(
price < 100 ~ "<100",
price >= 100 & price < 400 ~ "100-400",
price >= 400 & price < 700 ~ "400-700",
price >= 700 & price < 1000 ~ "700-1000",
price >= 1000 ~ ">1000"
))
# Count the number of houses in each price category per year built
price_category_per_year_built <- filtered_data %>%
group_by(yr_built, price_category) %>%
summarise(count = n()) %>%
ungroup()
# Filter the data for houses built in 1954, 1974, 1994, and 2014 with 2 floors
filtered_data_2_floors <- master_housing %>%
filter(yr_built %in% c(1954, 1974, 1994, 2014) & floors == 2)
# Categorize the price of houses
filtered_data_2_floors <- filtered_data_2_floors %>%
mutate(price_category = case_when(
price < 100 ~ "<100",
price >= 100 & price < 400 ~ "100-400",
price >= 400 & price < 700 ~ "400-700",
price >= 700 & price < 1000 ~ "700-1,000",
price >= 1000 ~ ">1,000"
))
# Count the number of houses in each price category per year built
price_category_per_year_built_2_floors <- filtered_data_2_floors %>%
group_by(yr_built, price_category) %>%
summarise(count = n()) %>%
ungroup()
We filtered data for specific time milestone to compare, categorized square of living, and calculated the average price.
# Filter the data for houses built in 1954, 1974, 1994, and 2014
filtered_data_sqft_living <- master_housing %>%
filter(yr_built %in% c(1954, 1974, 1994, 2014))
# Categorize the sqft_living of houses
filtered_data_sqft_living <- filtered_data_sqft_living %>%
mutate(sqft_living_category = case_when(
sqft_living < 1000 ~ "<1000",
sqft_living >= 1000 & sqft_living < 1500 ~ "1000-1500",
sqft_living >= 1500 & sqft_living < 2000 ~ "1500-2000",
sqft_living >= 2000 & sqft_living < 2500 ~ "2000-2500",
sqft_living >= 2500 ~ ">2500"
))
# Calculate the average price of houses in each sqft_living category per year built
avg_price_sqft_living_category_per_year_built <- filtered_data_sqft_living %>%
group_by(yr_built, sqft_living_category) %>%
summarise(avg_price = mean(price, na.rm = TRUE)) %>%
ungroup()
We calculated how many houses per floor.
# Summarize the count of houses per number of floors
floors_count <- master_housing_city %>%
group_by(floors) %>%
summarise(count = n())
# Summarize the count of houses per number of floors per city
floors_count_city <- master_housing_city %>%
group_by(city, floors) %>%
summarise(count = n())
And we compared houses with one and two floors in each city in research.
# Combine the data for 1-floor and 2-floor houses
combined_houses <- master_housing_city %>%
filter(floors == 1 | floors == 2) %>%
mutate(floor_type = ifelse(floors == 1, "1-Floor", "2-Floor"))
# Summarize the median price of houses per city and floor type
combined_houses_price_city <- combined_houses %>%
group_by(city, floor_type) %>%
summarise(median_price = median(price, na.rm = TRUE))
We categorized square footage of living and counted how many houses were within each range.
living_ranges <- c(0, 1000, 2000, 3000, 4000, 5000, 7000, 9000, 10000, Inf)
living_labels <- c("0-1000", "1000-2000", "2000-3000", "3000-4000", "4000-5000", "5000-7000", "7000-9000", "9000-10000", "10000+")
# Create a new column in the dataframe for the lot range categories
master_housing_city <- master_housing_city %>%
mutate(living_range = cut(sqft_living, breaks = living_ranges, labels = living_labels, right = FALSE))
# Calculate the frequency of houses in each lot range
living_range_frequency <- master_housing_city %>%
group_by(living_range) %>%
summarise(Frequency = n())
median_price_by_living_range <- master_housing_city %>%
group_by(living_range) %>%
summarise(Median_Price = median(price, na.rm = TRUE))
We divided the number of baths by the bedrooms, calculated average price, and removed rows with N/A values.
# Calculate bath/bed ratio
master_housing_city <- master_housing_city %>%
mutate(bath_bed_ratio = bathrooms / bedrooms)
# Create bins for the bath/bed ratio
master_housing_city <- master_housing_city %>%
mutate(bath_bed_bin = cut(bath_bed_ratio,
breaks = seq(0, max(master_housing_city$bath_bed_ratio, na.rm = TRUE), by = 0.2),
include.lowest = TRUE,
right = FALSE))
# Calculate average price for each bath/bed ratio bin
average_price_by_bin <- master_housing_city %>%
group_by(bath_bed_bin) %>%
summarize(average_price = mean(price, na.rm = TRUE))
# Remove rows with NA values in the bath_bed_bin column in the average_price_by_bin dataframe
average_price_by_bin <- average_price_by_bin %>%
filter(!is.na(bath_bed_bin))
transactions_per_day_of_week_plot <- ggplotly(ggplot(transactions_per_day_of_week, aes(x = day_of_week, y = total_transactions)) +
geom_bar(stat = "identity", color = "blue", fill = "skyblue") +
labs(title = "Total Number of Transactions per Day of the Week",
x = "Day of the Week",
y = "Total Number of Transactions") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 15,
face = "bold")
))
transactions_per_day_of_week_plot
Figure 3.1.1.1. Transaction frequency of principal cities in Seattle Metropolitan Area per days of the week.
transactions_per_week_plot <- ggplot(transactions_per_week, aes(x = week, y = total_transactions)) +
geom_bar(stat = "identity", color = "blue", fill = "skyblue") +
geom_line(color = "red", size = 1) +
labs(title = "Total Number of Transactions per Week",
x = "Week",
y = "Total Number of Transactions") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 15,
face = "bold")
)
transactions_per_week_plot <- ggplotly(transactions_per_week_plot)
transactions_per_week_plot
Figure 3.1.1.2. Transaction frequency principal cities in Seattle Metropolitan Area per week.
total_price_by_day_of_week$day_of_week <- factor(total_price_by_day_of_week$day_of_week,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))
transaction_values_per_day_plot <- ggplotly(ggplot(total_price_by_day_of_week, aes(x = day_of_week, y = total_price)) +
geom_bar(stat = "identity", color = "blue", fill = "skyblue") +
labs(title = "Transaction Values per Day of the Week",
x = "Day of the Week",
y = "Total Price (KUSD)") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 15,
face = "bold")
))
transaction_values_per_day_plot
Figure 3.1.2.1. Transaction values of principal cities in Seattle Metropolitan Area per days of the week (KUSD).
total_transaction_cost_plot <- ggplotly(ggplot(total_transaction_cost_per_week, aes(x = week, y = total_transaction_cost)) +
geom_bar(stat = "identity", color = "blue", fill = "skyblue") +
geom_line(color = "red", size = 1) +
labs(title = "Transaction Values per Week",
x = "Week",
y = "Total price (KUSD)") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 15,
face = "bold")
))
total_transaction_cost_plot
Figure 3.1.2.2. Transaction values of principal cities in Seattle Metropolitan Area per week (KUSD).
histogram_price <- ggplotly(ggplot(cleaned_data, aes(x = price)) +
geom_histogram(aes(y = ..density..), binwidth = 30, color = "black", fill = "skyblue") +
geom_density(color = "red", size = 1) +
geom_vline(aes(xintercept = mean_price), color = "blue", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = median_price), color = "green", linetype = "dashed", size = 1) +
labs(title = "Histogram of Housing Prices (Without Outliers and Zero Price)",
x = "Price (KUSD)",
y = "Density") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")))
histogram_price
Figure 3.1.3.1. Distribution of house prices (without outliers and prices equal to 0).
price_range_per_week_plot <- ggplot(data_clean_each_week, aes(x = week, y = price)) +
geom_boxplot(fill = "skyblue", color = "blue", outlier.shape = NA) +
geom_line(data = median_price_per_week, aes(x = week, y = median_price), color = "red", size = 1) +
labs(title = "Price Range per Week (Without Outliers) and Median Prices",
x = "Week",
y = "Price (KUSD)") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold"))
price_range_per_week_plot <- ggplotly(price_range_per_week_plot)
price_range_per_week_plot
Figure 3.1.3.2. House Price Ranges (without outliers and prices equal to 0) by week.
price_category_per_week_plot <- ggplot(price_category_per_week, aes(x = week, y = count, fill = price_category)) +
geom_bar(stat = "identity") +
labs(title = "Number of Houses in Each Price Category per Week",
x = "Week",
y = "Number of Houses",
fill = "Price Category (KUSD)") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold"))
price_category_per_week_plot <- ggplotly(price_category_per_week_plot)
price_category_per_week_plot
Figure 3.1.3.3. House Price Categories (without outliers and prices equal to 0) by week.
house_age_price <- ggplotly(ggplot(stats_by_age) +
geom_line(aes(x = age, y = avg_price, color = "Average Price")) +
geom_smooth(aes(x = age, y = avg_price, color = "Trend"), method = "loess", se = FALSE) +
labs(title = "Average House Price by Age",
x = "Age of House (Years)",
y = "Average Price (KUSD)") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")) +
scale_color_manual(values = c("Average Price" = "blue", "Trend" = "red")))
house_age_price
Figure 3.2.2.1. The relationship between house age (years) and price (KUSD).
house_age_transaction_frequency <- ggplotly(ggplot(transaction_counts, aes(x = age, y = transaction_count)) +
geom_smooth(method = "loess", size = 1, color = "blue", se = FALSE) +
labs(title = "Number of Transactions by House Age",
x = "House Age (years)",
y = "Number of Transactions") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")))
house_age_transaction_frequency
Figure 3.2.1.2. The relationship between house age (years) and transaction frequency.
waterfront_price <- ggplotly(ggplot(waterfront_prices, aes(x = factor(waterfront), y = average_price, fill = factor(waterfront))) +
geom_bar(stat = "identity") +
labs(title = "Average House Price: Waterfront vs Non-Waterfront (KUSD)",
x = "Waterfront (0 = No, 1 = Yes)",
y = "Average House Price (KUSD)") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")))
waterfront_price
Figure 3.2.2.1. Comparison of house prices (KUSD) between houses with and without waterfront.
condition_wtf_transaction <- ggplotly(ggplot(condition_distribution, aes(x = factor(condition), y = transaction_count, fill = factor(waterfront))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Transaction Count Based on Houses' Condition and Waterfront",
x = "Condition (1 = Poor, 5 = Excellent)",
y = "Transaction Count",
fill = "Waterfront (0 = No, 1 = Yes)") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")))
condition_wtf_transaction
Figure 3.2.2.2. Transaction frequency based on House Conditions and Waterfront.
house_condition_price <- ggplotly(ggplot(avg_price_by_condition, aes(x = factor(condition), y = avg_price)) +
geom_line(group = 1, color = "blue") +
geom_point(color = "red") +
geom_smooth(method = "loess", se = FALSE, color = "green") +
labs(title = "Average House Price (KUSD) by House Condition",
x = "Condition (1 = Poor, 5 = Excellent)",
y = "Average Price (KUSD)") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 12,
face = "bold")))
house_condition_price
**Figure 3.2.2.3. The relationship between average house price (KUSD) and conditions.
ggplotly(ggplot(filtered_years, aes(x = factor(yr_built), fill = price_category)) +
geom_bar(position = "fill") +
labs(title = "Proportion of House Prices for Selected Years by Price Category (KUSD)",
x = "Year Built",
y = "Proportion",
fill = "Price Category") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")) +
scale_fill_brewer(palette = "Set3") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)))
Figure 3.3.3.1. Proportion of House Prices for Selected Years by Price Category (KUSD).
ggplotly(ggplot(price_category_per_year_built, aes(x = factor(yr_built), y = count, fill = price_category)) +
geom_bar(stat = "identity") +
labs(title = "Number of Houses in Each Price Category by Year Built - 1 Floor (KUSD)",
x = "Year Built",
y = "Number of Houses",
fill = "Price Category") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")))
Figure 3.3.3.2. Number of Houses in Each Price Category by Year Built - 1 Floor (KUSD).
ggplotly(ggplot(price_category_per_year_built_2_floors, aes(x = factor(yr_built), y = count, fill = price_category)) +
geom_bar(stat = "identity") +
labs(title = "Number of Houses in Each Price Category by Year Built - 2 Floors",
x = "Year Built",
y = "Number of Houses",
fill = "Price Category") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")))
Figure 3.3.3.2.1. Number of Houses in Each Price Category by Year Built - 2 Floors (KUSD)
ggplotly(ggplot(avg_price_sqft_living_category_per_year_built, aes(x = factor(yr_built), y = avg_price, fill = sqft_living_category)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Average House Price by Sqft Living Category and Year Built (KUSD)",
x = "Year Built",
y = "Average Price",
fill = "Sqft Living Category") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")))
Figure 3.3.3.3. Average House Price by Sqft Living Category and Year Built (KUSD).
# Filter the data for the years 1954, 1974, 1994, 2014
filtered_years <- master_housing %>%
filter(yr_built %in% c(1954, 1974, 1994, 2014))
# Create the interactive boxplot
ggplotly(ggplot(filtered_years, aes(x = factor(yr_built), y = price, fill = factor(waterfront, labels = c("No waterfront", "Waterfront")))) +
geom_boxplot() +
labs(title = "House Prices (KUSD) for Selected Years Grouped by Waterfront Status",
x = "Year Built",
y = "Price (KUSD)",
fill = "Waterfront") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")) +
scale_fill_brewer(palette = "Set3") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
)
Figure 3.3.3.4. House Prices (KUSD) for Selected Years Grouped by Waterfront Status.
ggplot(floors_count, aes(x = as.factor(floors), y = count)) +
geom_bar(stat = "identity", fill = "blue", color = "red") +
geom_text(aes(label = count), vjust = -0.5) +
labs(title = "Frequency of Houses by Number of Floors",
x = "Number of Floors",
y = "Number of Houses") +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5)) +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold"))
Figure 3.2.4.1. The relationship between house distribution and number of floors.
ggplotly(ggplot(combined_houses_price_city, aes(x = reorder(city, -median_price), y = median_price, fill = floor_type)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Median Price (KUSD) of 1-Floor vs 2-Floor Houses in Each City",
x = "City",
y = "Median Price (KUSD)",
fill = "Floor Type") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")))
Figure 3.2.4.1.1. Comparison of Median Price of Houses with 1 and 2 floors in Seattle Metropolitan Area cities.
ggplotly(ggplot(living_range_frequency, aes(x = living_range, y = Frequency)) +
geom_bar(stat = "identity", fill = "pink", color= "blue") +
labs(title = "Frequency of Houses by Living Range",
x = "Living Range (sqft)",
y = "Frequency") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")))
Figure 3.2.4.2. House Distributions by Square Footage of Living.
ggplotly(ggplot(median_price_by_living_range, aes(x = living_range, y = Median_Price)) +
geom_bar(stat = "identity", fill = "pink", color = "blue") +
labs(title = "Median House Price by Living Range (KUSD)",
x = "Living Range (sqft)",
y = "Median Price (KUSD)") +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")))
Figure 3.2.4.3. Median House Price (KUSD) by Square Footage of Living (sqft).
ggplotly(ggplot(average_price_by_bin, aes(x = bath_bed_bin, y = average_price)) +
geom_bar(stat = "identity", fill = "pink", width = 0.7) +
geom_smooth(aes(group = 1), method = "lm", se = FALSE, color = "dark blue", linetype = "dashed") +
labs(title = "Average House Price by Bath/Bed Ratio Bin with Trendline",
x = "Bath/Bed Ratio Bin",
y = "Average House Price") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 13)) +
theme(
plot.title = element_text(hjust = 0.5,
color = "blue",
size = 10,
face = "bold")))
Figure 3.2.4.4. Average House Price by Bath/Bed Ratio Bin with Trendline.
Note.
0.25 Bathroom: Usually just a toilet (sometimes with a sink), no shower or bathtub.
0.5 Bathroom: Contains a toilet and sink, but no shower or bathtub.
1.0 Bathroom: A full bathroom with a toilet, sink, and either a shower or bathtub.
We can explain the trends for transaction frequency and transaction values during the period for several reasons. Firstly, the transaction took place the most on Tuesdays and other weekdays, apart from a drop on Fridays. This may be because buyers and sellers want to avoid delays when rectifying anything going wrong in the purchasing process (DeMarco, 2023); therefore, they tried to lower Friday transactions. On weekends, because these are not business days, not many services and banks are active, leading to a negligible of transactions. Thus, the implementation of the purchasing process and weekend inquiries will be moved to next Monday, which can explain their lower transaction frequency and values compared to Tuesdays. Secondly, transaction frequency and values peaked in the final week of May and June. A potential explanation for the observation would be lower interest savings from additional days before going to the upcoming months and better price deals because sellers tend to close on their houses by the end of the month. (Frame, 2015) The high number and volume of transactions recorded in June, were the same as what was reported in the National Association of Realtors (Nadia, 2019) or NJLux (2024). This tendency was because of the increasing demand, including families resettling for children before a new academic year and warmer weather conditions compared to winter.
For house price distribution, the consistent price range throughout the period may imply the stability of the housing market or the predictability of customer behavior. The stability was similar to the prediction compiled by The Week (2014). This also accentuates the correlation between the trends for housing’s transaction frequency and values. Regarding the weekly price range observed, a slight drop in the range of the final week of June could be reasoned by the aforementioned sellers’ striving to deal with the prices.
After finding that single-storey houses used to be preferred pre-1974 but the trend shifted immediately when two-storey houses were introduced after 1974. It can be analyzed that pre-1974 houses were often small and single-story as they might have been built for post-war families that had small incomes. The Ranch-style house is one common example that was extremely popular due to the booming post-war middle class of the 1940s to the 1970s (Salant, 2006). After 1974, house demand continued rising but the decline of land availability led to the popularity of two-story houses to optimize living areas. The Ranch-style lost its popularity was also partly due to the suburban area.
Also, the fact that older houses tended to have more fluctuating prices and were more costly due to the design they followed and the unique historical values they held. This could be explained by the historical significance and unique design of old houses that commanded more price, and it had to be big to follow the construction style (such as Queen Anne style).
Lastly, speaking of the location of houses, most houses with waterfront views were built in 1974 but not others. This might be explained by the increase in income and growth of recreational lifestyle after the war during the 1970s. A surge of middle and upper-middle class were also able to afford beachfront houses. Most importantly, the waterfront development of the government to adapt this space had also attracted people to construct houses here (Morehouse, 1981). There is no paper or common sense to back it up, and it might be attributed to the non-representative sample of the data.
Speaking of floors of houses, people tended to prefer houses with either one floor or two floors. People prefer these types due to their cost efficiency, ease of maintenance, and better accessibility for families with children or elderly members. Additionally, these homes offer practical space optimization, particularly in urban areas with limited land and higher construction costs, while still providing energy efficiency and privacy. (Booth (2022)), (Montoya (2021)). The obvious trend that two-story houses were more expensive than single-story houses suggests that the additional living space and functionality provided by an extra floor consistently adds value to homes, leading to higher prices. Especially in luxurious and expensive cities, the gap was more significant.
Regarding house sizes, the buying behavior in which people tend to make efficient-cost purchases could also be a justification for the popularity of mid-sized houses. These sizes are likely ideal for families and fit well within budget and space constraints (Booth (2022)). Also, the positive correlation between the house’s size and the price can also be confirmed through our findings. Thus, square footage significantly influences house prices, with larger homes commanding higher prices due to their added space, features, and appeal to wealthier buyers.
Properties with a low bath/bed ratio are generally perceived as less desirable, leading to lower prices. Buyers often find homes with insufficient bathrooms relative to bedrooms inconvenient, which diminishes their appeal and market value. However, a ratio of 1—one bathroom for every bedroom—represents an ideal setup for family living, enhancing overall desirability and market value. Conversely, when the ratio exceeds 1.6, there is a dramatic drop in prices. This indicates an excess of bathrooms relative to bedrooms, which buyers may perceive as unnecessary. While additional bathrooms can be seen as a benefit, the market shows diminishing returns in value for properties with more bathrooms than needed, leading to reduced buyer interest and lower prices.
Younger houses tended to have higher average prices often due to modern features, energy efficiency, and lower maintenance needs. According to Realtor.com, “Newly built homes are often priced higher, but builders frequently offer financial incentives such as lower interest rates or help with closing costs, making them attractive options for buyers”. (Neumann, 2023). Within the range of 40-80 years old, the price decreased, and this can be attributed to the higher likelihood of needing updates or renovations, which can deter buyers. According to Realtor.com, “older homes in need of repairs often face lower market value, especially if major systems like plumbing or electrical wiring need to be replaced”. And lastly, houses that were more than 80 years old (which were built before the 1930s, however, saw a growth in price. This increase can be linked to historical value, architectural uniqueness, and well-preserved characteristics. According to Realtor.com, this trend might be explained by the belief that an old house tends to be more well-established and probably in a stable area.” (Neumann, 2023)
Also, newer homes typically have a higher number of transactions, which aligns with buyers’ preferences for modern amenities and lower maintenance costs. Newer homes often sell faster because they have contemporary designs, and newer appliances, and tend to require fewer immediate repairs. However, the older houses were not a very common choice. This decline may be due to older homes often needing more maintenance or upgrades, which can deter potential buyers. Older homes are often considered less desirable due to maintenance concerns, outdated designs, and the need for renovations, which contributes to a reduced volume of sales. In some cases, however, historically significant or well-preserved homes may attract a niche group of buyers who value architectural uniqueness, which could lead to some older homes still seeing sales activity despite the general decline.
In terms of luxurious properties, waterfront properties tend to have a significant premium over non-waterfront homes, often driven by factors like the quality of water access, views, and overall location. According to Vincent Diaz from Vaster (2014), “The answer lies in three key aspects: the unique aesthetic appeal, the recreational opportunities, and the exclusivity that waterfront living offers”.
Additionally, waterfront properties are seen as a limited commodity, which adds to their price premium. The quality of the water, its depth, and clarity, as well as the property’s accessibility to the water, also play important roles in determining value. Homes with direct access to the water or those with private docks tend to fetch higher prices because they provide greater recreational convenience. Moreover, environmental factors such as the risk of flooding or erosion can affect the long-term value of these properties. Buyers must weigh these risks against the inherent.
Regarding house condition, people tend to purchase houses that are in average condition (levels 3 and 4) as they are not too bad to live in or also not too expensive to afford as in level 5. As for levels 1 and 2, we can predict that their future trends will also stay the same as it is not good to purchase a house where you need to fix nearly everything before moving in.
Additionally, the pandemic has accelerated this trend, with more people seeking open space and views, leading to an increased demand for waterfront living. Knight Frank (2022) has reported that “The pandemic and successive lockdowns have put greenery and space at the top of buyers’ requirements for homes, fuelling sales and price growth in prime regional markets”. According to (Griffith, 2024), houses that underwent renovation or improvements, especially in kitchens and bathrooms, experienced a considerable price boost, reflecting buyers’ desire for move-in-ready homes. Houses in poor condition, by contrast, may struggle to sell at market value, as potential buyers factor in the cost of necessary repairs. The relationship between condition and price is particularly important in cities like Seattle, where housing inventory is limited, and buyers are willing to pay more for homes that are well-maintained or recently renovated.
While working with the dataset, having acknowledged that it is a sample for studying, there were some limitations from the dataset that hindered the process of processing and analysing data to compute meaningful findings.
The dataset contained a small sample. It only reported house purchases in three months of 2014 → could not fully predict the buying patterns across the years. Therefore, the findings could not be 100% guaranteed that they could generalize for the population. (only 2474 out of around 123.23 million). (Korhonen, 2024)
Some variables exhibited in this dataset are not thoroughly and carefully explained - leading to ambiguity in the way we understand the variable. For example, regarding the condition, how was this data collected? And to what extent were they reliable (because we see a paradox where houses built in the same year, with comparable features, renovated houses are of lower condition than none).
While the sample size is already small, there are also null values within the data set. Although we managed to rule out the null values (N/A), it still might hinder researchers like us from understanding the broad picture of the real estate market in the USA in three months in 2014.
Interesting findings such as peak transaction volumes seen on weekdays instead of our expected weekend could not be solidified by trustworthy papers - we could only give some possible explanations. However, we believed that some sort of information traced back to past events somehow made our findings meaningful.
Researchers analysing this dataset (we) do not live in the USA so we might do have a comprehensive understanding of the real estate market, which would result in the negligibility of some important parameters or specific nuances that affect the relationship between house prices and house features.
The process of sampling the dataset and self-selection of some data for our analysis might be driven by our motivation to find meaningful findings. This unconsciously biased sampling method data selection and and it might not be representative of the entire US housing market.
Most researchers’ understanding of variables is based on the brief description given from the dataset rather than the deep understanding of each variable. Therefore, we might also neglect some underlying factors or essential relationships between them.
In conclusion, with the dataset about USA House Price in the context of recording house transactions in three months - May, June, and July of 2014 - we saw some interesting trends. Through exploring and processing the data, we chose meaningful variables and figured out the connection between them to find the answer to our research questions surrounding the dataset. We were curious about the customer behavior pattern in purchasing houses in the Seattle Metropolitan Area from May to July 2014 through examining the transaction (values and trend in time), and specific house features such as size, amenities, and location.
One of the observations was a positive correlation between house price and special amenities/features: house size, floors, bathroom, bedroom, view, condition, and house age. Also, different periods might reflect different preferences in building houses, which later affect buying behaviors. After analyzing data, we also understood the psychology of buyers when buying a home, as they tended to maximize the capacity and optimize their money - which can be seen through the popularity of houses with optimal ratio of bathroom/bedroom; or buying young houses to avoid complicated renovation. Another novel point about transactions was that people tended to make transactions on weekdays rather than weekends (which was contrary to our thoughts), and this could be explained by the fact that paperwork relating to transactions was processed during business hours.
From the findings, we suggest that it could be useful for both home buyers and home sellers. Homebuyers could maximize their investment by focusing more on efficient layouts, such as on floors or ratio of rooms to ensure a living experience while getting the best return on investment. Also, sellers might get insight into what house features could be attractive to buyers so that they could highlight those features and strategically build a price plan based on the condition and amenities of the house, such as the location of a house (with a waterfront view). In general, we believe that people who are interested in investing or selling real estate should understand the condition of the market and make data-driven decisions; as it would lead to the most beneficial outcomes for all parties.
Booth, S. (2022, October 26). One-Story vs. Two-Story Home: Which Is Better? Realtor.com. Retrieved October 19, 2024, from https://www.realtor.com/advice/buy/one-story-vs-two-story-home-which-is-better/
Diaz, V. (2024, July 8). How much does waterfront add to property value?. Market Insights. https://blog.vaster.com/how-much-does-waterfront-add-to-property-value
DeMarco, J. (2023, November 15). The worst day to close on a house - orchard. RSS. https://orchard.com/blog/posts/worst-day-to-close-on-a-house
Frame, T. (2015, June 4). The advantages of buying a home at the end of the Month. The Seattle Times. https://www.seattletimes.com/business/real-estate/the-advantages-of-buying-a-home-at-the-end-of-the-month/
Griffith, K. (2024, September 26). The home renovations that will boost your listing price the most. Real Estate News & Insights | realtor.com®. https://www.realtor.com/advice/sell/home-renovations-boost-listing-price/
How does the summer season influence the housing market?. NJ Lux Real Estate. (2024, June 10). https://www.njlux.com/how-does-the-summer-season-influence-the-housing-market/
Korhonen, V. (2024, July 5). U.S.: Number of households 1960-2023. Statista. https://www.statista.com/statistics/183635/number-of-households-in-the-us/?fbclid=IwZXh0bgNhZW0CMTEAAR14hJDTd0NJ3s6GCLfL_GfMBZFrR3I5HImRZLEgg-E6mKrpPGR1Xu1uiMw_aem_kJwcYUfYr04aB6KZ9_CsMw
Montoya, K. (2021, March 8). One-Story vs. Two-Story Homes: How to Decide on the Right Floor Plan. Adair Homes Blog. Retrieved October 19, 2024, from https://blog.adairhomes.com/homebuilding-tips/one-story-vs.-two-story-homes-how-to-decide-on-the-right-floor-plan
Morehouse, W. (1981, April 24). American cities rediscover their historic waterfronts - The Washington Post. The Washington Post. https://www.washingtonpost.com/archive/realestate/1981/04/25/american-cities-rediscover-their-historic-waterfronts/787096bb-f6e0-42d6-a354-7952f5976da0/
Nadia, E. (2019, January 2). Seasonality in the housing market. National Association of REALTORS®. https://www.nar.realtor/blogs/economists-outlook/seasonality-in-the-housing-market
Neumann, K. D. (2023, June 13). The weird reason why brand-new homes are now cheaper than old ones-and where they are hiding. Real Estate News & Insights | realtor.com®. https://www.realtor.com/advice/buy/how-to-find-a-brand-new-house-that-costs-less-than-an-existing-home/
Overview of the Seattle area, Washington (metro area). The Demographic Statistical Atlas of the United States - Statistical Atlas. (n.d.). https://statisticalatlas.com/metro-area/Washington/Seattle/Overview#google_vignette
Staff, T. W. (2014, January 7). A quick guide to real estate in 2014. theweek. https://theweek.com/articles/453581/quick-guide-real-estate-2014
Salant, K. (2006, December 29). The Ranch, an architectural archetype forged on the frontier - The Washington Post. https://www.washingtonpost.com/archive/realestate/2006/12/30/the-ranch-an-architectural-archetype-forged-on-the-frontier/e92d7b5b-78d7-4f7f-aa5b-68eff176ca60/
Which Global City generates the highest waterfront property price premium?. Knight Frank. (2022, August 9). https://www.knightfrank.com/research/article/2022-08-09-which-global-city-generates-the-highest-waterfront-property-price-premium